Introduction¶

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

About the Dataset of Google Play Store Apps & Reviews¶

Data Source:
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed here.

Import Statements¶

In [1]:
# !pip install plotly
import pandas as pd
import plotly.express as px

Notebook Presentation¶

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

Read the Dataset¶

In [3]:
df_apps = pd.read_csv('apps.csv')

Data Cleaning¶

Challenge: How many rows and columns does df_apps have? What are the column names? Look at a random sample of 5 different rows with .sample().

In [4]:
df_apps.sample()
Out[4]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Last_Updated Android_Ver
4212 iBP Blood Pressure MEDICAL 4.40 578 0.69 10,000 Paid $0.99 Everyone Medical November 30, 2014 2.2 and up
In [5]:
df_apps.shape
Out[5]:
(10841, 12)
In [6]:
df_apps.head()
Out[6]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Last_Updated Android_Ver
0 Ak Parti Yardım Toplama SOCIAL NaN 0 8.70 0 Paid $13.99 Teen Social July 28, 2017 4.1 and up
1 Ain Arabic Kids Alif Ba ta FAMILY NaN 0 33.00 0 Paid $2.99 Everyone Education April 15, 2016 3.0 and up
2 Popsicle Launcher for Android P 9.0 launcher PERSONALIZATION NaN 0 5.50 0 Paid $1.49 Everyone Personalization July 11, 2018 4.2 and up
3 Command & Conquer: Rivals FAMILY NaN 0 19.00 0 NaN 0 Everyone 10+ Strategy June 28, 2018 Varies with device
4 CX Network BUSINESS NaN 0 10.00 0 Free 0 Everyone Business August 6, 2018 4.1 and up

Drop Unused Columns¶

Challenge: Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.

In [7]:
df_apps.drop(['Android_Ver', 'Last_Updated'], axis=1, inplace=True)
In [8]:
df_apps.head()
Out[8]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
0 Ak Parti Yardım Toplama SOCIAL NaN 0 8.70 0 Paid $13.99 Teen Social
1 Ain Arabic Kids Alif Ba ta FAMILY NaN 0 33.00 0 Paid $2.99 Everyone Education
2 Popsicle Launcher for Android P 9.0 launcher PERSONALIZATION NaN 0 5.50 0 Paid $1.49 Everyone Personalization
3 Command & Conquer: Rivals FAMILY NaN 0 19.00 0 NaN 0 Everyone 10+ Strategy
4 CX Network BUSINESS NaN 0 10.00 0 Free 0 Everyone Business

Find and Remove NaN values in Ratings¶

Challenge: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called df_apps_clean that does not include these rows.

In [9]:
nan_objects = df_apps[df_apps.Rating.isna()]
nan_objects.head()
Out[9]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
0 Ak Parti Yardım Toplama SOCIAL NaN 0 8.70 0 Paid $13.99 Teen Social
1 Ain Arabic Kids Alif Ba ta FAMILY NaN 0 33.00 0 Paid $2.99 Everyone Education
2 Popsicle Launcher for Android P 9.0 launcher PERSONALIZATION NaN 0 5.50 0 Paid $1.49 Everyone Personalization
3 Command & Conquer: Rivals FAMILY NaN 0 19.00 0 NaN 0 Everyone 10+ Strategy
4 CX Network BUSINESS NaN 0 10.00 0 Free 0 Everyone Business
In [10]:
df_apps_clean = df_apps.dropna()
In [11]:
df_apps.shape
Out[11]:
(10841, 10)
In [12]:
df_apps_clean.shape
Out[12]:
(9367, 10)
In [13]:
nan_objects.shape
Out[13]:
(1474, 10)

Find and Remove Duplicates¶

Challenge: Are there any duplicates in data? Check for duplicates using the .duplicated() function. How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.

In [14]:
df_apps_clean.duplicated()
Out[14]:
21       False
28       False
47       False
82       False
99       False
         ...  
10836    False
10837    False
10838    False
10839     True
10840    False
Length: 9367, dtype: bool
In [15]:
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
duplicated_rows.head()
(476, 10)
Out[15]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
946 420 BZ Budeze Delivery MEDICAL 5.00 2 11.00 100 Free 0 Mature 17+ Medical
1133 MouseMingle DATING 2.70 3 3.90 100 Free 0 Mature 17+ Dating
1196 Cardiac diagnosis (heart rate, arrhythmia) MEDICAL 4.40 8 6.50 100 Paid $12.99 Everyone Medical
1231 Sway Medical MEDICAL 5.00 3 22.00 100 Free 0 Everyone Medical
1247 Chat Kids - Chat Room For Kids DATING 4.70 6 4.90 100 Free 0 Mature 17+ Dating
In [16]:
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])
df_apps_clean[df_apps_clean.App == 'Instagram']
Out[16]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
10806 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social
In [17]:
df_apps_clean.shape
Out[17]:
(8199, 10)

Find Highest Rated Apps¶

Challenge: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [18]:
df_apps_clean.sort_values('Rating', ascending=False)
Out[18]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
21 KBA-EZ Health Guide MEDICAL 5.00 4 25.00 1 Free 0 Everyone Medical
1230 Sway Medical MEDICAL 5.00 3 22.00 100 Free 0 Everyone Medical
1227 AJ Men's Grooming LIFESTYLE 5.00 2 22.00 100 Free 0 Everyone Lifestyle
1224 FK Dedinje BGD SPORTS 5.00 36 2.60 100 Free 0 Everyone Sports
1223 CB VIDEO VISION PHOTOGRAPHY 5.00 13 2.60 100 Free 0 Everyone Photography
... ... ... ... ... ... ... ... ... ... ...
1314 CR Magazine BUSINESS 1.00 1 7.80 100 Free 0 Everyone Business
1932 FE Mechanical Engineering Prep FAMILY 1.00 2 21.00 1,000 Free 0 Everyone Education
357 Speech Therapy: F FAMILY 1.00 1 16.00 10 Paid $2.99 Everyone Education
818 Familial Hypercholesterolaemia Handbook MEDICAL 1.00 2 33.00 100 Free 0 Everyone Medical
1208 Thistletown CI PRODUCTIVITY 1.00 1 6.60 100 Free 0 Everyone Productivity

8199 rows × 10 columns

Find 5 Largest Apps in terms of Size (MBs)¶

Challenge: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?

In [19]:
df_apps_clean.sort_values('Size_MBs', ascending=False)
Out[19]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
9942 Talking Babsy Baby: Baby Games LIFESTYLE 4.00 140995 100.00 10,000,000 Free 0 Everyone Lifestyle;Pretend Play
10687 Hungry Shark Evolution GAME 4.50 6074334 100.00 100,000,000 Free 0 Teen Arcade
9943 Miami crime simulator GAME 4.00 254518 100.00 10,000,000 Free 0 Mature 17+ Action
9944 Gangster Town: Vice District FAMILY 4.30 65146 100.00 10,000,000 Free 0 Mature 17+ Simulation
3144 Vi Trainer HEALTH_AND_FITNESS 3.60 124 100.00 5,000 Free 0 Everyone Health & Fitness
... ... ... ... ... ... ... ... ... ... ...
2648 Ad Remove Plugin for App2SD PRODUCTIVITY 4.10 66 0.02 1,000 Paid $1.29 Everyone Productivity
5798 ExDialer PRO Key COMMUNICATION 4.50 5474 0.02 100,000 Paid $3.99 Everyone Communication
2684 My baby firework (Remove ad) FAMILY 4.10 30 0.01 1,000 Paid $0.99 Everyone Entertainment
7966 Market Update Helper LIBRARIES_AND_DEMO 4.10 20145 0.01 1,000,000 Free 0 Everyone Libraries & Demo
4696 Essential Resources LIBRARIES_AND_DEMO 4.60 237 0.01 50,000 Free 0 Everyone Libraries & Demo

8199 rows × 10 columns

Find the 5 App with Most Reviews¶

Challenge: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [20]:
df_apps_clean.sort_values('Reviews', ascending=False)
Out[20]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
10805 Facebook SOCIAL 4.10 78158306 5.30 1,000,000,000 Free 0 Teen Social
10785 WhatsApp Messenger COMMUNICATION 4.40 69119316 3.50 1,000,000,000 Free 0 Everyone Communication
10806 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social
10784 Messenger – Text and Video Chat for Free COMMUNICATION 4.00 56642847 3.50 1,000,000,000 Free 0 Everyone Communication
10650 Clash of Clans GAME 4.60 44891723 98.00 100,000,000 Free 0 Everyone 10+ Strategy
... ... ... ... ... ... ... ... ... ... ...
453 Wowkwis aq Ka'qaquj FAMILY 5.00 1 49.00 10 Free 0 Everyone Education;Education
462 CB Fit HEALTH_AND_FITNESS 5.00 1 7.80 10 Free 0 Everyone Health & Fitness
901 ES Billing System (Offline App) PRODUCTIVITY 5.00 1 4.20 100 Free 0 Everyone Productivity
1416 Ek Kahani Aisi Bhi Season 3 - The Horror Story FAMILY 3.00 1 5.80 100 Free 0 Teen Entertainment
1100 Chenoweth AH MEDICAL 5.00 1 27.00 100 Free 0 Everyone Medical

8199 rows × 10 columns

Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings¶

In [21]:
    ratings = df_apps_clean.Content_Rating.value_counts()
    ratings
Out[21]:
Content_Rating
Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: count, dtype: int64
In [22]:
fig = px.pie(labels=ratings.index, values=ratings.values)
fig.show()
In [23]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
)
fig.update_traces(textposition='outside', textinfo='percent+label')

fig.show()
In [24]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
hole=0.6,
)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')

fig.show()
# my fav so far

Numeric Type Conversion: Examine the Number of Installs¶

Challenge: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.

In [25]:
df_apps_clean.Installs.describe()
Out[25]:
count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object
In [26]:
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB
In [27]:
df_apps_clean[['App', 'Installs']].groupby('Installs').count()
Out[27]:
App
Installs
1 3
1,000 698
1,000,000 1417
1,000,000,000 20
10 69
10,000 988
10,000,000 933
100 303
100,000 1096
100,000,000 189
5 9
5,000 425
5,000,000 607
50 56
50,000 457
50,000,000 202
500 199
500,000 504
500,000,000 24
In [28]:
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "")
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App', 'Installs']].groupby('Installs').count()
Out[28]:
App
Installs
1 3
5 9
10 69
50 56
100 303
500 199
1000 698
5000 425
10000 988
50000 457
100000 1096
500000 504
1000000 1417
5000000 607
10000000 933
50000000 202
100000000 189
500000000 24
1000000000 20

Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate¶

Let's examine the Price column more closely.

Challenge: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the df_apps_clean DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?

In [29]:
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
In [30]:
df_apps_clean.sort_values('Price', ascending=False).head(20)
Out[30]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
3946 I'm Rich - Trump Edition LIFESTYLE 3.60 275 7.30 10000 Paid 400.00 Everyone Lifestyle
2461 I AM RICH PRO PLUS FINANCE 4.00 36 41.00 1000 Paid 399.99 Everyone Finance
4606 I Am Rich Premium FINANCE 4.10 1867 4.70 50000 Paid 399.99 Everyone Finance
3145 I am rich(premium) FINANCE 3.50 472 0.94 5000 Paid 399.99 Everyone Finance
3554 💎 I'm rich LIFESTYLE 3.80 718 26.00 10000 Paid 399.99 Everyone Lifestyle
5765 I am rich LIFESTYLE 3.80 3547 1.80 100000 Paid 399.99 Everyone Lifestyle
1946 I am rich (Most expensive app) FINANCE 4.10 129 2.70 1000 Paid 399.99 Teen Finance
2775 I Am Rich Pro FAMILY 4.40 201 2.70 5000 Paid 399.99 Everyone Entertainment
3221 I am Rich Plus FAMILY 4.00 856 8.70 10000 Paid 399.99 Everyone Entertainment
3114 I am Rich FINANCE 4.30 180 3.80 5000 Paid 399.99 Everyone Finance
1331 most expensive app (H) FAMILY 4.30 6 1.50 100 Paid 399.99 Everyone Entertainment
2394 I am Rich! FINANCE 3.80 93 22.00 1000 Paid 399.99 Everyone Finance
3897 I Am Rich FAMILY 3.60 217 4.90 10000 Paid 389.99 Everyone Entertainment
2193 I am extremely Rich LIFESTYLE 2.90 41 2.90 1000 Paid 379.99 Everyone Lifestyle
3856 I am rich VIP LIFESTYLE 3.80 411 2.60 10000 Paid 299.99 Everyone Lifestyle
2281 Vargo Anesthesia Mega App MEDICAL 4.60 92 32.00 1000 Paid 79.99 Everyone Medical
1407 LTC AS Legal MEDICAL 4.00 6 1.30 100 Paid 39.99 Everyone Medical
2629 I am Rich Person LIFESTYLE 4.20 134 1.80 1000 Paid 37.99 Everyone Lifestyle
2481 A Manual of Acupuncture MEDICAL 3.50 214 68.00 1000 Paid 33.99 Everyone Medical
4264 Golfshot Plus: Golf GPS SPORTS 4.10 3387 25.00 50000 Paid 29.99 Everyone Sports

The most expensive apps sub $250¶

In [31]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values('Price', ascending=False).head(5)
Out[31]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
2281 Vargo Anesthesia Mega App MEDICAL 4.60 92 32.00 1000 Paid 79.99 Everyone Medical
1407 LTC AS Legal MEDICAL 4.00 6 1.30 100 Paid 39.99 Everyone Medical
2629 I am Rich Person LIFESTYLE 4.20 134 1.80 1000 Paid 37.99 Everyone Lifestyle
2481 A Manual of Acupuncture MEDICAL 3.50 214 68.00 1000 Paid 33.99 Everyone Medical
2463 PTA Content Master MEDICAL 4.20 64 41.00 1000 Paid 29.99 Everyone Medical

Highest Grossing Paid Apps (ballpark estimate)¶

In [32]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]
Out[32]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Revenue_Estimate
9220 Minecraft FAMILY 4.50 2376564 19.00 10000000 Paid 6.99 Everyone 10+ Arcade;Action & Adventure 69,900,000.00
8825 Hitman Sniper GAME 4.60 408292 29.00 10000000 Paid 0.99 Mature 17+ Action 9,900,000.00
7151 Grand Theft Auto: San Andreas GAME 4.40 348962 26.00 1000000 Paid 6.99 Mature 17+ Action 6,990,000.00
7477 Facetune - For Free PHOTOGRAPHY 4.40 49553 48.00 1000000 Paid 5.99 Everyone Photography 5,990,000.00
7977 Sleep as Android Unlock LIFESTYLE 4.50 23966 0.85 1000000 Paid 5.99 Everyone Lifestyle 5,990,000.00
6594 DraStic DS Emulator GAME 4.60 87766 12.00 1000000 Paid 4.99 Everyone Action 4,990,000.00
6082 Weather Live WEATHER 4.50 76593 4.75 500000 Paid 5.99 Everyone Weather 2,995,000.00
7954 Bloons TD 5 FAMILY 4.60 190086 94.00 1000000 Paid 2.99 Everyone Strategy 2,990,000.00
7633 Five Nights at Freddy's GAME 4.60 100805 50.00 1000000 Paid 2.99 Teen Action 2,990,000.00
6746 Card Wars - Adventure Time FAMILY 4.30 129603 23.00 1000000 Paid 2.99 Everyone 10+ Card;Action & Adventure 2,990,000.00

Plotly Bar Charts & Scatter Plots: Analysing App Categories¶

In [33]:
df_apps_clean.Category.nunique()
Out[33]:
33
In [34]:
top10 = df_apps_clean.Category.value_counts()[:10]
top10
Out[34]:
Category
FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: count, dtype: int64
In [ ]:
 

Vertical Bar Chart - Highest Competition (Number of Apps)¶

In [35]:
bar = px.bar(x = top10.index, y = top10.values)
bar.show()

Horizontal Bar Chart - Most Popular Categories (Highest Downloads)¶

In [36]:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)
In [37]:
h_bar = px.bar(x = category_installs.Installs,
               y = category_installs.index,
               orientation='h',
               title='Category Popularity')
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

Category Concentration - Downloads vs. Competition¶

Challenge:

  • First, create a DataFrame that has the number of apps in one column and the number of installs in another:
  • Then use the plotly express examples from the documentation alongside the .scatter() API referenceto create scatter plot that looks like this.

Hint: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')

In [47]:
cat_df = df_apps_clean.groupby('Category').agg({'App':pd.Series.count,
                                                  'Installs': pd.Series.sum
                                                 })
cat_df.sort_values('Installs', ascending=False, inplace=True)
cat_df
Out[47]:
App Installs
Category
GAME 910 13858762717
COMMUNICATION 257 11039241530
TOOLS 719 8099724500
PRODUCTIVITY 301 5788070180
SOCIAL 203 5487841475
PHOTOGRAPHY 263 4649143130
FAMILY 1606 4437554490
VIDEO_PLAYERS 148 3916897200
TRAVEL_AND_LOCAL 187 2894859300
NEWS_AND_MAGAZINES 204 2369110650
ENTERTAINMENT 102 2113660000
BOOKS_AND_REFERENCE 169 1665791655
PERSONALIZATION 298 1532352930
SHOPPING 180 1400331540
HEALTH_AND_FITNESS 243 1134006220
SPORTS 260 1096431465
BUSINESS 262 692018120
LIFESTYLE 297 503611120
MAPS_AND_NAVIGATION 118 503267560
FINANCE 296 455249400
WEATHER 72 361096500
EDUCATION 118 352852000
FOOD_AND_DRINK 94 211677750
DATING 134 140912410
ART_AND_DESIGN 61 114233100
HOUSE_AND_HOME 62 97082000
AUTO_AND_VEHICLES 73 53129800
LIBRARIES_AND_DEMO 64 52083000
COMICS 54 44931100
MEDICAL 292 39162676
PARENTING 50 31116110
BEAUTY 42 26916200
EVENTS 45 15949410
In [48]:
scatter = px.scatter(cat_df, # data
                    x='App', # column name
                    y='Installs',
                    title='Category Concentration',
                    size='App',
                    hover_name=cat_df.index,
                    color='Installs')

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))

scatter.show()

Extracting Nested Data from a Column¶

Challenge: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.

In [49]:
df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]
Out[49]:
Genres
Lifestyle;Pretend Play      1
Strategy;Education          1
Adventure;Education         1
Role Playing;Brain Games    1
Tools;Education             1
Name: count, dtype: int64
In [50]:
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}')
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')
We now have a single column with shape: (8564,)
Number of genres: 53

Colour Scales in Plotly Charts - Competition in Genres¶

Challenge: Can you create this chart with the Series containing the genre data?

Try experimenting with the built in colour scales in Plotly. You can find a full list here.

  • Find a way to set the colour scale using the color_continuous_scale parameter.
  • Find a way to make the color axis disappear by using coloraxis_showscale.
In [51]:
bar = px.bar(x = num_genres.index[:15], y = num_genres.values[:15],
            title='Top Genres',
            hover_name=num_genres.index[:15],
            color=num_genres.values[:15],
            color_continuous_scale='Agsunset')

bar.update_layout(xaxis_title='Genre',
yaxis_title='Number of Apps',
coloraxis_showscale=False)

bar.show()

Grouped Bar Charts: Free vs. Paid Apps per Category¶

In [52]:
df_apps_clean.Type.value_counts()
Out[52]:
Type
Free    7595
Paid     589
Name: count, dtype: int64
In [53]:
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid
Out[53]:
Category Type App
0 ART_AND_DESIGN Free 58
1 ART_AND_DESIGN Paid 3
2 AUTO_AND_VEHICLES Free 72
3 AUTO_AND_VEHICLES Paid 1
4 BEAUTY Free 42
... ... ... ...
56 TRAVEL_AND_LOCAL Paid 8
57 VIDEO_PLAYERS Free 144
58 VIDEO_PLAYERS Paid 4
59 WEATHER Free 65
60 WEATHER Paid 7

61 rows × 3 columns

Challenge: Use the plotly express bar chart examples and the .bar() API reference to create this bar chart:

You'll want to use the df_free_vs_paid DataFrame that you created above that has the total number of free and paid apps per category.

See if you can figure out how to get the look above by changing the categoryorder to 'total descending' as outlined in the documentation here here.

In [54]:
g_bar = px.bar(df_free_vs_paid,
               x='Category',
               y='App',
               title='Free vs Paid Apps by Category',
               color='Type',
               barmode='group')

g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'))

g_bar.show()

Plotly Box Plots: Lost Downloads for Paid Apps¶

Challenge: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the Box Plots Guide and the .box API reference to create the following chart.

In [55]:
fig_downloads = px.box(df_apps_clean, 
             x='Type', 
             y='Installs',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?')
fig_downloads.update_layout(yaxis=dict(type='log'))
 
fig_downloads.show()

Plotly Box Plots: Revenue by App Category¶

Challenge: See if you can generate the chart below:

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories.

In [56]:
df_paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']

box = px.box(df_paid_apps, 
             x='Category', 
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))


box.show()

How Much Can You Charge? Examine Paid App Pricing Strategies by Category¶

Challenge: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using {categoryorder':'max descending'} to sort the categories.

In [57]:
df_paid_apps.Price.median()
Out[57]:
2.99
In [62]:
box = px.box(df_paid_apps,
             x='Category',
             y="Price",
             title='Price per Category')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))

box.show()
In [ ]:
 
In [ ]:
 
In [ ]: